Final Project

Data Science for Public Policy

Author

Madeleine Adelson, Stephanie Jamilla, Jamie Jelly Murtha

About This Project

Urban environments tend to retain heat due to having less tree/vegetation cover, more impervious surfaces such as asphalt, and other factors. Using data on DC’s landcover, we built and tested three different models to predict where hotspots are most likely to occur within the District. We also included several variables from the American Community Survey (ACS) to test whether demographic characteristics could also be predictive of hotspots. Our analysis is at the block group level.

Setup

Loading packages:

library(dotenv)
library(tidyverse)
library(ggplot2)
library(sf)
library(tidycensus)
library(stringr)
library(magrittr)
library(jsonlite)
library(httr)
library(tidymodels)
library(knitr)
library(kableExtra)
library(data.table)
library(patchwork)
library(vip)

# Clone the private github repository for this assignment using SSH link:
# git@github.com:stejamilla/final-project.git

# do not use scientific notation
options(scipen = 999)

Loading and Cleaning the Data

DC Land Cover Data

Loading data from Open Data DC (landcover, ward geometry):

# Loading Urban Tree Canopy by Census Block Group (2020) data
landcover <- read_sf(paste0("data/Urban_Tree_Canopy",
                            "_by_Census_Block_Group_in_2020.shp")) |>
  rename(block_group = GEOID) |>
  mutate(block_group = as.character(block_group)) |>
  mutate(OBJECTID = as.character(OBJECTID)) |>
  rename_all(tolower) |>
  relocate(block_group, .after = objectid) |>
  # Converting UHI to Fahrenheit
  mutate(uhi = (uhi*(9/5) + 32))

# Loading DC Ward data from Urban Tree Canopy by Ward (2020)
wards <- read_sf("data/Urban_Tree_Canopy_by_Ward_in_2020.shp") |>
  rename_all(tolower) |>
  select(ward) |>
  st_transform(crs = st_crs(landcover))

Create key with Urban Tree Canopy data column descriptions and clean up:

landcover_key <- read_csv("data/Urban_Tree_Canopy_Descriptions.csv",
                          col_names = FALSE) |>
  mutate(X1 = str_remove_all(X1, "\\( type:.*alias:|, length:.*| \\)")) |>
  separate(col = X1,
           into = c("field", "description"),
           sep = " ",
           extra = "merge") |>
  mutate(field = tolower(field))

Categorizing each block group into their respective ward:

# Spatial join so that each block group is assigned a ward based on majority
# area of a block group within a ward
landcover <- st_join(landcover, wards, largest = TRUE)

# Mapping the overlapping wards & block groups
ggplot() +
  geom_sf(data = landcover,
          aes(fill = ward))

STEPHANIE 12/13 - this code says that if a ward intersects with a blockgroup, then classify that block group as a ward. For block groups that overlap multiple wards, it returns the ward that the block group overlaps with the most area (that’s the argument largest = TRUE)

Exploring Census Demographic Data

Load ACS 5-yr 2016-2020 variables:

acs_variables_bg <- load_variables(2020, "acs5") |>
  filter(geography == "block group")

MADELEINE PRE-12/12: * It occurs to me we may need to choose baseline categories for any categorical variables we include and leave those out? Not sure if that is only a requirement for line and linear regression? JAMIE 12/12 - Good point! I’m adding stuff in below without factoring that in for now, but I think we def need to determine if we need to deal with this! STEPHANIE 12/13 - I’m not too concerned about this because we’ve never had to address categorical variables in the assignments. We can do something like “step_dummy()” if we want to recode a categorical variable into a dummy instead.

Loading Census demographic data via API

# Secure Census API key (from .env file) for use in Census API call
load_dot_env()
credential <- Sys.getenv("census_api_key")

# Create url for API call
url <- str_glue(paste0("https://api.census.gov/data/2020/acs/acs5?get=",
             # Description of data point
             "NAME,",
             # TOTAL POPULATION ESTIMATE
             "B01003_001E,",
             # TOTAL POPULATION BY RACE - WHITE ALONE
             "B02001_002E,",
             # TOTAL POPULATION BY RACE - BLACK OR AFR AMER ALONE
             "B02001_003E,",
             # TOTAL POPULATION BY RACE - AMER IND AND AK NAT ALONE
             "B02001_004E,",
             # TOTAL POPULATION BY RACE - ASIAN ALONE
             "B02001_005E,",
             # TOTAL POPULATION BY RACE - NAT HAW OR PAC ISL ALONE
             "B02001_006E,",
             # TOTAL POPULATION BY RACE - OTHER ALONE + TWO OR MORE RACES
             "B02001_008E,",
             # TOTAL HISPANIC OR LATINO
             "B03003_003E,",
             # TOTAL POP WITH DOCTORATE DEGREE
             "B15003_025E,",
             # TOTAL POP WITH PROFESSIONAL SCHOOL DEGREE
             "B15003_024E,",
             # TOTAL POP WITH MASTER'S DEGREE
             "B15003_023E,",
             # TOTAL POP WITH BACHELORS DEGREE
             "B15003_022E,",
             # TOTAL POP WITH ASSOCIATES DEGREE
             "B15003_021E,",
             # TOTAL POP WITH REGULAR HIGH SCHOOL DIPLOMA
             "B15003_017E,",
             # TOTAL POP WITH GED OR ALTERNATIVE CREDENTIAL
             "B15003_018E,",
             # MEDIAN HH INCOME IN PAST 12 MOS
             "B19013_001E,",
             # TOTAL HOUSING UNITS
             "B25002_001E,",
             # TOTAL OCCUPIED HOUSING UNITS
             "B25002_002E,",
             # TOTAL VACANT HOUSING UNITS
             "B25002_003E,",
             # MEDIAN PROPERTY VALUE
             "B25077_001E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # TOTAL
             "B25070_001E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 30 TO 34.9 %
             "B25070_007E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 35 TO 39.9 %
             "B25070_008E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 40 TO 49.9 %
             "B25070_009E,",
             # GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
             # 50 % or more
             "B25070_010E",
             # Limit to block-group-level data for DC
             "&for=block%20group:*&for=tract:*&in=state:11&in=county:001"))

# Use url to request data from API
acs_json <- GET(url = url)

# Check call status
http_status(acs_json)
$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"
# Save API JSON response as text
acs_json <- content(acs_json, as = "text")

# Save JSON as character matrix
acs_matrix <- fromJSON(acs_json)

# Convert matrix to tibble
acs_data <- as_tibble(acs_matrix[2:nrow(acs_matrix), ],
                      .name_repair = "minimal")

# Add variable names to tibble
names(acs_data) <- acs_matrix[1, ] |>
  str_replace(" ", "_")

Labeling and cleaning ACS variables:

# Clean up ACS data
acs_data <- acs_data |>
  # Meaningful variable names
  rename(total_pop = B01003_001E,
         race_white = B02001_002E,
         race_black = B02001_003E,
         race_ai_an = B02001_004E,
         race_asian = B02001_005E,
         race_nh_pi = B02001_006E,
         race_oth_mult = B02001_008E,
         race_eth_hisp = B03003_003E,
         ed_doct = B15003_025E,
         ed_prof_deg = B15003_024E,
         ed_master = B15003_023E,
         ed_bach = B15003_022E,
         ed_assoc = B15003_021E,
         ed_ged = B15003_018E,
         ed_reg_hsd = B15003_017E,
         med_hhi = B19013_001E,
         med_prop_val = B25077_001E,
         rent_burden_tot = B25070_001E,
         rent_burden_30_34 = B25070_007E,
         rent_burden_35_39 = B25070_008E,
         rent_burden_40_49 = B25070_009E,
         rent_burden_50plus = B25070_010E,
         total_hous_units = B25002_001E,
         vac_units = B25002_003E,
         occ_units = B25002_002E,
         block_group_temp = block_group) |>
  # Create block_group column to use for combining this with the other datasets
  mutate(block_group = paste(state,
                             county, 
                             tract, 
                             block_group_temp,
                             sep = "")) |>
  # Convert Census variables to numeric
  mutate_at(vars(total_pop:rent_burden_50plus), as.numeric) |>
  rename_all(tolower) |>
  relocate(block_group, .before = name)


# Convert rent burden variables into a percentage of rent burdened households,
# then dropping original rent burden vars
acs_data <- acs_data |> 
  mutate(
    rent_burden_sum = 
      rent_burden_30_34 + 
       rent_burden_35_39 + 
       rent_burden_40_49 + 
       rent_burden_50plus
    ) |>
  mutate(
    rent_burden_pct = (rent_burden_sum / rent_burden_tot)
    ) |>
  select(-c(rent_burden_30_34, 
            rent_burden_35_39, 
            rent_burden_40_49, 
            rent_burden_50plus,
            rent_burden_tot,
            rent_burden_sum)
         )

Data Imputation Part 1

IMPUTATION The imputation logic works as follows: 1) If data is missing at the block group level, use the mean for that block group’s census tract.

  1. If data is still missing at the census tract level, use the mean for that census tract’s ward. (This happens further down in the code.)

This method is imperfect for two reasons: 1) Imputed values at the census tract level feed into imputation at the ward level. For instance, say a ward is made up of 4 census tracts and each tract has 4 block groups. Initially, only 2/4 tracts have complete data. After imputing at the tract level, 3/4 are complete. When we go to impute the data for the final tract, that calculation will include the values we imputed for the other missing tract. I don’t think this is that big of a deal, especially because the numbers of values that need to be imputed at the ward level is pretty small.

2) Median household income and median property value are top- and bottom-coded (no one has an income less than $2,499 or greater than $250,000). The only way around this I can think of would be to choose one or more cutoffs as I did with the rent burden variable (for example, categorize income into “high income” and “low income”). I didn’t do this because I wasn’t sure what those cutoffs would be. I’m also not really sure how much it matters, since either way we can’t capture the real high- and low-end incomes.

Imputation part 1: replace missing data points with tract-level data.

# Create tract dataframe
tract_data <- acs_data |>
  mutate(med_hhi = if_else(
    med_hhi < 0, NA, med_hhi),
    med_prop_val = if_else(
      med_prop_val < 0, NA, med_prop_val)
  ) |>
  group_by(tract) |>
  summarize(
    rb_tract = mean(rent_burden_pct, na.rm = TRUE),
    hhi_tract = mean(med_hhi, na.rm = TRUE),
    propval_tract = mean(med_prop_val, na.rm = TRUE)
  )


# Replace missing data with tract data:
  
acs_data <- acs_data |>
  left_join(tract_data, by = "tract") |>
  # Rent burden
  mutate(
    # Create indicator for imputed vars
    rb_imputed = if_else(
      is.na(rent_burden_pct), 1, 0),
    # Tract-level imputation
    rent_burden_pct = if_else(
      is.na(rent_burden_pct), rb_tract, rent_burden_pct)
  ) |>
  # Median HHI
  mutate(
    # Create indicator for imputed vars
    hhi_imputed = if_else(
      med_hhi < 0, 1, 0),
    # Tract-level imputation
    med_hhi = if_else(
    med_hhi < 0, hhi_tract, med_hhi)
  ) |>
  # Median property value
  mutate(
    # Create indicator for imputed vars
    propval_imputed = if_else(
      med_prop_val < 0, 1, 0),
    # Tract-level imputation
    med_prop_val = if_else(
    med_prop_val < 0, propval_tract, med_prop_val)
  )

Creating the Final Dataset for Analysis

Deciding threshold for hotspot vs. not hotspot

Deciding the threshold for hot vs. not hot:

# Visualizing a cumulative histogram of counts of UHI values
landcover |>
  group_by(uhi) |>
  ggplot(aes(uhi)) +
  geom_histogram(aes(y = cumsum(..count..)))

# Calculating frequency & cumulative frequency
heat_freq <- landcover |>
  count(uhi)

heat_freq$cumulative <- cumsum(heat_freq$n)

# Exploring other ways to generate a cutoff
heat_freq |> summarize(
  sd(uhi),
  mean(uhi)
)
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 389616.6 ymin: 124877.9 xmax: 407860.4 ymax: 147545.8
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 3
  `sd(uhi)` `mean(uhi)`                                                 geometry
      <dbl>       <dbl>                                            <POLYGON [m]>
1      1.80        89.8 ((394387 136007.6, 394387.7 136010.7, 394388 136012.3, …
# The standard deviation of UHI is 1.80
# The mean is 89.8
  # 1 SD above mean would be 91.6

heat_freq |>
  filter(uhi >= 91.6) |>
  count()
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 395375 ymin: 134605.6 xmax: 401893.4 ymax: 143306.4
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 2
      n                                                                 geometry
* <int>                                                       <MULTIPOLYGON [m]>
1    82 (((395651.3 136857, 395651.4 136859.2, 395657.1 136859.3, 395664.3 1368…
# 82 block groups have a UHI value at least one SD above the mean.

STEPHANIE 12/15 - I think doing one standard deviation makes sense. It didn’t adversely affect the decision tree model at least, and probably makes more sense than just choosing the mid point.

MADELEINE 12/14: I split up the join from the metrics because the ward-level imputation has to happen in between (after join, before metrics).

dc_full_data <- left_join(landcover, acs_data, by = "block_group") 

Data Imputation Part 2

Imputation part 2: for any data points still missing after imputing at tract level, impute at ward level.

# Create ward-level dataframe
ward_data <- dc_full_data |>
  group_by(ward) |>
  summarize(
    rb_ward = mean(rent_burden_pct, na.rm = TRUE),
    hhi_ward = mean(med_hhi, na.rm = TRUE),
    propval_ward = mean(med_prop_val, na.rm = TRUE)
  ) |>
  st_drop_geometry()

# Replace missing data with ward data:
# Rent burden
dc_full_data <- dc_full_data |>
  left_join(ward_data, by = "ward") |>
  mutate(
    rent_burden_pct = if_else(
      is.na(rb_tract), rb_ward, rent_burden_pct),
    med_hhi = if_else(
      is.na(hhi_tract), hhi_ward, med_hhi),
    med_prop_val = if_else(
      is.na(propval_tract), propval_ward, med_prop_val)
    )
  
# Drop columns no longer needed
dc_full_data <- dc_full_data |>
  select(-c(rb_tract, rb_ward,
            hhi_tract, hhi_ward,
            propval_tract, propval_ward))

Finalizing full dataset

dc_full_data <- dc_full_data|>
  mutate(
    # convert total acres to sq miles
    total_sq_mi = total_ac * 0.0015625,
    # convert water acres to sq miles
    water_sq_mi = wat_ac * 0.0015625,
    # calculate water share of total sq miles
    share_water_sq_mi = water_sq_mi / total_sq_mi,
    # calculate population density per square mile
    pop_dens_sq_mi = total_pop / total_sq_mi,
    # calculate population density per water sq mile
    pop_density_water_sq_mi = ifelse(
      water_sq_mi !=0, total_pop / water_sq_mi, 0),
    # calculate ratio of housing units to population
    hous_units_per_person = ifelse(
      total_hous_units != 0, total_hous_units / total_pop, 0),
    # calculate ratio of vacant units to all units
    vac_units_share = ifelse(
      total_hous_units != 0, vac_units / total_hous_units, 0),
    # calculate ratio of total income to median home value
    inc_home_val_ratio = med_hhi / med_prop_val,
    # create combined HS diploma or equivalent variable
    ed_comb_hsd = ed_reg_hsd + ed_ged,
    # create combined advanced post-undergraduate variable
    ed_comb_adv_deg = ed_doct + ed_prof_deg + ed_master,
    # calculate ratio of 25+ pop with HSD to total pop
    ed_hsd_ratio = ifelse(
      ed_comb_hsd !=0, ed_comb_hsd / total_pop, 0),
    # calculate ratio of 25+ pop with bachelor's degree to total pop
    ed_bach_ratio = ifelse(
      ed_bach != 0, ed_bach / total_pop, 0),
    # calculate ratio of 25+ pop with advanced post-bach degree to total pop
    ed_adv_ratio = ifelse(
      ed_comb_adv_deg != 0, ed_comb_adv_deg / total_pop, 0),
    # calculate pop share white
    race_white_ratio = ifelse(
      race_white != 0, race_white / total_pop, 0),
    # calculate pop share black or african american
    race_black_ratio = ifelse(
      race_black != 0, race_black / total_pop, 0),
    # calculate pop share american indian or alaska native
    race_ai_an_ratio = ifelse(
      race_ai_an!= 0, race_ai_an / total_pop, 0),
    # calculate pop share asian
    race_asian_ratio = ifelse(
      race_asian != 0, race_asian / total_pop, 0),
    # calculate pop share native hawaiian or pacific islander
    race_nh_pi_ratio = ifelse(
      race_nh_pi != 0, race_nh_pi / total_pop, 0),
    # calculate pop share two or more races
    race_oth_mult_ratio = ifelse(
      race_oth_mult != 0, race_oth_mult / total_pop, 0),
    # calculate pop share hispanic or latino
    race_eth_hisp_ratio = ifelse(
      race_eth_hisp != 0, race_eth_hisp / total_pop, 0),
    # calculate total non-white population
    race_poc = race_black + race_ai_an + race_asian + race_nh_pi + race_oth_mult,
    # calculate pop share non-white
    race_poc_ratio = ifelse(
      race_poc != 0, race_poc / total_pop, 0),
    # creating hotspot var for decision tree model
    hotspot = if_else(
      uhi >= 91.6, "hotspot", "not hotspot"),
    # remove factor from uhi var
    uhi = as.numeric(as.character(uhi))) |>
  relocate(c("block_group_temp", "tract"), .after = 2) |>
  # remove columns with only one unique value
  select(-objectid,
         -name,
         -state,
         -statefp,
         -county,
         -countyfp,
         -tractce,
         -blkgrpce,
         -namelsad,
         -mtfcc,
         -funcstat,
         -field,
         -shapearea,
         -shapelen) |>
  relocate(race_white_ratio:race_eth_hisp_ratio,
           .after = race_eth_hisp) |>
  relocate(ed_comb_hsd:ed_adv_ratio,
           .after = ed_ged) |>
  relocate(hous_units_per_person:inc_home_val_ratio,
           .after = med_prop_val) |>
  relocate(total_sq_mi:pop_density_water_sq_mi,
           .after = inc_home_val_ratio) |>
  relocate(total_pop:hotspot, .after = tract)

Evaluating imputation via visualization

Missing data have been imputed using the next smallest geography available. In order to assess the accuracy of these imputations, we plot the variables with imputed values against another variable with no imputed values, one which should have some correlation with the partially imputed variable. We can then see if the imputed values follow the general trend of the correlation or not.

#### PERCENT RENT BURDENED ####

# It was harder finding variables that are clearly correlated with this than it 
# was for some of the other partially-imputed variables. These are the most
# useful results I got.

# % rent burdened vs. POC ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = rent_burden_pct,
        color = as.character(rb_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# Not a super strong trend here, but all imputed data points seem to fit
# the general shape of the data.

# % rent burdened vs. ward
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ward, y = rent_burden_pct,
        color = as.character(rb_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# All the outliers here (rent_burden_pct = 1 or = 0) are not imputed values.
# Wouldn't want to rely on this alone since we impute using ward-level means
# in a few cases, but alongside the other charts, it provides further reassurance.

#### MEDIAN HOUSEHOLD INCOME ####

# Median household income vs. POC ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = race_poc_ratio, y = med_hhi,
        color = as.character(hhi_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# This looks pretty promising to me in terms of the imputed values being reasonable
# One or two semi-outliers, but imputed values generally follow trend of other data
# Don't think we need to worry too much about values where x = 0;
# these likely had no population data or actually no population

# Median household income vs. bachelor's degree ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = med_hhi,
        color = as.character(hhi_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# The outlier here is really just an outlier on in terms of ed_bach_ratio.
# Arguably the imputed med_hhi should have been higher, but not sure how much
# of an issue this is considering it's only one data point.

# Median household income vs. ward
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ward, y = med_hhi,
        color = as.character(hhi_imputed))
  )

# This looks good to me too. Really no imputed values are outliers in their ward.


#### MEDIAN PROPERTY VALUE ####

# Median property value vs. POC ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = race_poc_ratio, y = med_prop_val,
        color = as.character(propval_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# Imputed values seem to fit trend of data quite well.

# Median property value vs. bachelor's degree ratio
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ed_bach_ratio, y = med_prop_val,
        color = as.character(propval_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# Slightly weirdly shaped data, but the imputed points do not seem to be
# outside the trend. Only outlier is on the ed_bach_ratio var.

# Median property value vs. ward
dc_full_data |> ggplot() +
  geom_point(mapping =
    aes(x = ward, y = med_prop_val,
        color = as.character(propval_imputed))
  ) +
  labs(color = "imputed (0 = no, 1 = yes)")

# No imputed values are outliers in their ward.

# Remove indicator variables for imputation
dc_full_data <- dc_full_data |>
  select(-rb_imputed,
         -hhi_imputed,
         -propval_imputed)

Prepare the data for analysis (set seed & splitting)

JAMIE 12/12 - Changed the var names so that these are the splits we use for all models, not just decision tree

  • Do we need an implementation set too?

STEPHANIE 12/13 - Aaron mentioned that we should make sure that when we split the data, that we should stratify the data by some geographic feature so that for example not all the block groups in a certain ward are only in the training data. Because then our model would only really work for that ward. Thus, the argument strata = ward. * Concerning implementation data, I worry that since we only have a set of n = 571 that taking out a number would significantly affect our model. Maybe we can just do a set of 10 data points for the implementation data like we did in Assignment 7, but we’d have to randomize which ten are taken out (which we didn’t do in assignment 7, we sliced the data by where observations were located in the dataset)

# Set seed
set.seed(20241202)

# Split sample and create training and testing datasets
dc_split <- initial_split(data = dc_full_data, strata = ward, prop = 0.8)
dc_train <- training(x = dc_split) 
dc_test <- testing(x = dc_split)

Initial Exploratory Data Analysis

Review summary statistics

JAMIE 12/12 - Added summary stats table here

# Review summary statistics
kable(dc_train_summary <- as_tibble(dc_train |>
                                      select(where(is.numeric)) |>
                                      lapply(summary) |>
                                      lapply(`length<-`, 6)) |>
        mutate(measure = c("Min", "Q1", "Median", "Mean", "Q3", "Max")) |>
        mutate(across(everything(), as.vector)) |>
        pivot_longer(-measure) |>
        pivot_wider(id_cols = "name",
                    names_from = "measure",
                    values_from = "value"))
name Min Q1 Median Mean Q3 Max
total_pop 0.0000000 850.5000000 1139.0000000 1221.2087912 1555.5000000 3742.0000000
race_white 0.0000000 71.5000000 439.0000000 494.4307692 795.5000000 2335.0000000
race_black 0.0000000 87.0000000 348.0000000 557.2703297 871.5000000 3661.0000000
race_ai_an 0.0000000 0.0000000 0.0000000 4.4747253 0.0000000 508.0000000
race_asian 0.0000000 0.0000000 21.0000000 49.1758242 60.5000000 773.0000000
race_nh_pi 0.0000000 0.0000000 0.0000000 0.6109890 0.0000000 71.0000000
race_oth_mult 0.0000000 6.5000000 30.0000000 52.1736264 72.5000000 487.0000000
race_eth_hisp 0.0000000 22.0000000 78.0000000 141.8263736 176.5000000 1451.0000000
race_white_ratio 0.0000000 0.0682352 0.4009009 0.4222330 0.7418562 1.0000000
race_black_ratio 0.0000000 0.0875971 0.3272624 0.4171346 0.7453525 1.0000000
race_ai_an_ratio 0.0000000 0.0000000 0.0000000 0.0032669 0.0000000 0.3793876
race_asian_ratio 0.0000000 0.0000000 0.0184049 0.0401567 0.0522593 0.7630800
race_nh_pi_ratio 0.0000000 0.0000000 0.0000000 0.0005430 0.0000000 0.0817536
race_oth_mult_ratio 0.0000000 0.0048783 0.0263158 0.0427554 0.0570762 0.3453901
race_eth_hisp_ratio 0.0000000 0.0217248 0.0674157 0.1061547 0.1591406 0.6898551
ed_doct 0.0000000 0.0000000 19.0000000 35.4373626 54.5000000 291.0000000
ed_prof_deg 0.0000000 9.0000000 46.0000000 76.7472527 121.0000000 465.0000000
ed_master 0.0000000 65.5000000 156.0000000 183.4659341 277.0000000 840.0000000
ed_bach 0.0000000 100.0000000 189.0000000 218.6505495 308.5000000 798.0000000
ed_assoc 0.0000000 0.0000000 14.0000000 26.7780220 40.0000000 413.0000000
ed_reg_hsd 0.0000000 12.0000000 74.0000000 123.2065934 184.5000000 907.0000000
ed_ged 0.0000000 0.0000000 4.0000000 23.2065934 27.0000000 498.0000000
ed_comb_hsd 0.0000000 18.5000000 88.0000000 146.4131868 219.0000000 1133.0000000
ed_comb_adv_deg 0.0000000 92.5000000 260.0000000 295.6505495 461.0000000 1094.0000000
ed_hsd_ratio 0.0000000 0.0169049 0.0785219 0.1084304 0.1757625 0.5833980
ed_bach_ratio 0.0000000 0.0899638 0.1766304 0.1888757 0.2700129 1.0000000
ed_adv_ratio 0.0000000 0.0844247 0.2486931 0.2575276 0.4085868 1.0000000
med_hhi 2499.0000000 60766.5000000 96354.0000000 104099.6783699 135677.0000000 250001.0000000
total_hous_units 0.0000000 381.5000000 525.0000000 554.1890110 717.5000000 1560.0000000
occ_units 0.0000000 349.0000000 469.0000000 500.9538462 643.5000000 1482.0000000
vac_units 0.0000000 13.5000000 42.0000000 53.2351648 80.5000000 368.0000000
med_prop_val 146700.0000000 407616.6666667 597900.0000000 637116.7003945 788800.0000000 2000001.0000000
hous_units_per_person 0.0000000 0.3621221 0.4486621 0.4728340 0.5845168 1.3872340
vac_units_share 0.0000000 0.0255810 0.0763501 0.0919955 0.1296251 0.8938053
inc_home_val_ratio 0.0055447 0.1164494 0.1602061 0.1707757 0.2154070 0.4492911
total_sq_mi 0.0065000 0.0373203 0.0621406 0.1194089 0.1161562 4.4441719
water_sq_mi 0.0000000 0.0000000 0.0000000 0.0124242 0.0000000 1.9424688
share_water_sq_mi 0.0000000 0.0000000 0.0000000 0.0161898 0.0000000 0.6663825
pop_dens_sq_mi 0.0000000 8937.9885706 18233.3333333 23744.1005885 30555.1000276 167846.1538462
pop_density_water_sq_mi 0.0000000 0.0000000 0.0000000 934308.6588760 0.0000000 88576000.0000000
rent_burden_pct 0.0000000 0.2792525 0.4251969 0.4166570 0.5592122 1.0000000
race_poc 0.0000000 208.0000000 500.0000000 663.7054945 967.5000000 3680.0000000
race_poc_ratio 0.0000000 0.2168948 0.4611239 0.5038566 0.8234325 1.0000000
aland 16848.0000000 96644.5000000 159891.0000000 276436.3274725 294718.0000000 6514228.0000000
awater 0.0000000 0.0000000 0.0000000 32846.4923077 0.0000000 4996439.0000000
gid 3.0000000 154.5000000 293.0000000 289.4483516 427.5000000 571.0000000
total_ac 4.1600000 23.8850000 39.7700000 76.4216923 74.3400000 2844.2700000
land_ac 4.1600000 23.8850000 39.5100000 68.4702418 73.2150000 1601.0900000
tcan_ac 0.1200000 4.8400000 9.9100000 25.5966593 24.2900000 647.3400000
tcan_pct 0.7400000 17.7600000 25.6500000 28.4931209 36.0150000 87.9600000
veg_ac 0.0200000 1.5500000 4.6600000 11.5154066 12.3500000 541.3000000
veg_pct 0.3000000 5.9150000 11.0300000 12.1633407 17.2700000 43.9300000
bld_ac 0.8700000 5.9200000 8.3300000 10.8100000 11.9700000 147.0600000
bld_pct 2.2700000 14.4550000 20.0600000 22.3882637 30.7750000 58.9000000
to_ia_ac 3.1600000 15.2350000 22.5300000 31.2712967 35.1900000 608.0700000
to_ia_pct 6.2500000 42.6600000 57.8700000 57.6620440 74.3250000 97.0800000
road_ac 0.6800000 3.6900000 5.6700000 8.1374286 9.4200000 207.7900000
road_pct 2.0600000 10.5000000 14.3200000 14.4510110 17.8550000 37.6900000
plot_ac 0.0000000 0.5450000 1.4200000 3.2879121 3.0250000 109.0900000
plot_pct 0.0000000 1.6700000 3.2900000 4.7149451 6.5850000 31.6400000
swalk_ac 0.2100000 1.1250000 1.7600000 2.7635385 2.9100000 140.0800000
swalk_pct 0.1900000 3.0100000 4.3200000 5.1728791 6.4500000 18.7100000
ot_ia_ac 0.1200000 2.6100000 4.4700000 6.2724396 7.3800000 82.4500000
ot_ia_pct 1.0500000 6.8100000 10.5900000 10.9345275 14.1700000 36.6800000
wat_ac 0.0000000 0.0000000 0.0000000 7.9514945 0.0000000 1243.1800000
wat_pct 0.0000000 0.0000000 0.0000000 1.6189734 0.0000000 66.6383791
soil_ac 0.0000000 0.0000000 0.0000000 0.0868791 0.0100000 7.9100000
soil_pct 0.0000000 0.0000000 0.0000000 0.0626154 0.0300000 2.6000000
utc_ac 0.1200000 4.8400000 9.9100000 25.5966593 24.2900000 647.3400000
utc_pct 0.7400000 18.1100000 25.7700000 28.9163297 36.4900000 88.3300000
ppa_v_ac 0.0100000 1.3450000 4.1400000 10.2947253 11.7350000 262.4600000
ppa_v_pct 0.2100000 5.1750000 10.1600000 11.6471648 17.0750000 43.9000000
to_ppa_ac 0.5600000 3.9300000 8.1300000 16.2005275 17.8450000 404.5100000
to_ppa_pct 5.7100000 14.3600000 20.2400000 21.6425714 27.5100000 57.7600000
ppa_ia_ac 0.2700000 1.7900000 3.3000000 5.9056264 5.9950000 142.0500000
ppa_ia_pct 0.4300000 5.4900000 8.9000000 9.9955604 13.4500000 35.9400000
un_v_ac 0.0000000 0.0000000 0.0300000 1.1835385 0.4800000 277.3300000
un_v_pct 0.0000000 0.0000000 0.0900000 0.7950549 0.6650000 17.3200000
to_un_ac 1.9300000 12.8500000 19.4900000 26.6733626 30.1100000 751.2900000
to_un_pct 5.8400000 36.3800000 49.5500000 49.4410769 64.0750000 90.9600000
un_ia_ac 1.9200000 12.7650000 19.4500000 25.4088132 29.1600000 467.5400000
un_ia_pct 5.5900000 34.9250000 47.5100000 48.5798022 63.2750000 86.4700000
un_sl_ac 0.0000000 0.0000000 0.0000000 0.0808571 0.0100000 6.4300000
un_sl_pct 0.0000000 0.0000000 0.0000000 0.0659780 0.0300000 3.6300000
wtr_ac 0.0000000 0.0000000 0.0000000 7.9514945 0.0000000 1243.1800000
wtr_pct 0.0000000 0.0000000 0.0000000 2.7711429 0.0000000 199.7500000
utc_ac_06 0.1200000 4.1550000 9.1100000 24.7237802 23.1050000 645.8900000
utc_pct_06 0.3500583 15.4988558 23.5890767 27.0868853 34.7122435 87.4272365
utcac0620 -26.2600000 -0.2100000 0.5900000 0.8728791 1.8050000 44.5900000
utcpct0620 -20.8511393 -0.6573329 2.0519734 1.8294444 4.5201446 19.9808700
utc_ac_11 0.1400000 4.7800000 9.3100000 25.3473626 22.8700000 648.8500000
utc_pct_11 2.1336761 17.0570509 26.0317460 28.5631563 36.5599560 88.5914522
utcac1120 -44.1500000 -0.8250000 0.0600000 0.2492967 1.1650000 50.5300000
utcpct1120 -21.2878206 -2.0987933 0.1500000 0.3531734 2.6685408 14.4022283
utc_ac_15 0.1300000 4.8000000 10.4100000 26.4955824 24.2500000 656.6800000
utc_pct_15 2.4164524 16.5306141 26.2942431 29.5338941 37.5922255 89.9739583
utcac1520 -43.0600000 -1.1300000 -0.1900000 -0.8989231 0.3400000 7.4900000
utcpct1520 -13.2604114 -2.4741031 -0.5030109 -0.6175644 1.3392326 6.8620961
enrgy_con 0.0000000 0.8132393 2.6790648 4.7613427 7.2676798 35.1203240
strm_wtr 0.5867829 5.5386833 9.0813977 10.0328023 13.4333690 35.9528898
uhi 82.7737511 88.5480260 90.0977402 89.7619244 91.0952792 93.4873438
land_own 0.0000000 0.0011507 0.3909353 2.5219515 2.5708216 50.9486314
wldlf_con 0.0000000 0.0000000 0.0695619 4.0515448 7.2214827 20.8969048
overall 5.2997772 6.9158712 7.1840150 7.2340716 7.5190972 9.4915959
low_utc 11.6700000 63.5100000 74.2300000 71.0836703 81.8900000 99.2600000
pos_utc 5.7100000 14.3600000 20.2400000 21.6425714 27.5100000 57.7600000
ward 1.0000000 3.0000000 5.0000000 4.5516484 6.0000000 8.0000000
geometry 455.0000000 0.0000000 0.0000000 NA NA NA

Exploration of some of the continuous variables

JAMIE 12/12 - when we run the below viz we have a warning message that 56 rows were removed.

  • I silenced the total ed attainment ggplot per by above note.

STEPHANIE 12/13 - concerning the rows being removed, I think that’s fine. I think that’s happened with me for other visualizations before. * Concerning the ed attainment, I’m getting a thing that “educ_total” doesn’t exist…so I’m not sure what’s happening here JAMIE 12/13 - I removed the var that went into educ_total bc it’s just the total line in the education table. So basically if we want to use a data viz here we need to use the categorical vars.

# Median household income, continuous
dc_train |>
  ggplot() +
  geom_histogram(aes(x = as.numeric(med_hhi))
  )

# Total educational attainment
#dc_train |>
#  ggplot() +
#  geom_histogram(aes(x = as.numeric(educ_total))
#  )
# Still not sure how to interpret this one. Might be worth going back to the categories

Exploring temperature in relation to other variables

JAME 12/12 - I fixed this - I guess we have to left_join with the sf df in the first position in left_join(), so that we join the non-sf df onto the sf df and it stays sf. BUT do we have to st_transform() since the crs is not 4326?

STEPHANIE 12/13 - It doesn’t matter that the CRS isn’t 4326 for mapping purposes - the main concern with CRS is that it has to be the same between two spatial datasets if you’re going to do something like join the two. I’m also realizing that we probably shouldn’t do a map here with dc_train because this map shows which block groups are in the training data (vs. the testing data) since you can see which block groups are excluded from the map…so this may be a form of data leakage. I’m going to instead use the dc_full_data dataset.

dc_full_data |> ggplot() +
  geom_sf(
    mapping = aes(fill = uhi),
    color = "#400001"
    ) +
    scale_fill_gradient(
    low = "#ffffff",
    high = "#be2b25"
    ) +
  labs(
    title = "Average evening temperature, 2018",
    fill = "Temperature (degrees F)"
  ) +
  theme_void()

Temperature & population:

# Total population
dc_train |>
  ggplot(aes(x = total_pop, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

# Population density
dc_train |>
  ggplot(aes(x = pop_dens_sq_mi, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

# Non-white population density
dc_train |>
  ggplot(aes(x = race_poc_ratio, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

Temperature & income/education:

# Median household income
dc_train |>
  ggplot(aes(x = med_hhi, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

# Ratio of bachelors degrees
dc_train |>
  ggplot(aes(x = ed_bach_ratio, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

Temperature & landcover:

# Comparing with percentage of non-canopy vegetation
dc_train |>
  ggplot(aes(x = veg_pct, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

# Comparing with percentage area of urban tree canopy
dc_train |>
  ggplot(aes(x = utc_pct, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

# Comparing with percentage area of impervious surfaces (e.g., roads, parking
# lots, etc.)
dc_train |>
  ggplot(aes(x = to_ia_pct, y = uhi, color = hotspot)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

STEPHANIE 12/13 - The graphs that compare UHI with percentage area of urban tree canopy and of impervious surfaces seems quite compelling. This makes sense, and it’s interesting to notice how not a lot of the demographic stuff doesn’t seem to show up in EDA but the landcover stuff seem like promising predictors.

Preparation for Modeling

STEPHANIE 12/13 - I added “-tract” and “-ward” to remove more geographic indicators that likely would affect the model, since that otherwise would’ve been a main predictor of a hotspot

JAMIE 12/12 - Do we want folds? added here. Not using folds does not make the lm rank deficient model warning go away. STEPHANIE 12/13 - I think yes, we’ll have to use it, especially for hyper parameter tuning. My main issue with folding at this point is that I think we need to remove all the variables we don’t want from dc_train first and then get do folding so that the folding process doesn’t take them into account. So I think we should take out the variables from dc_train before the lm and rf models (using select rather than step_rm) and then do the folding. JAMIE 12/13 - Relocated these steps per discussion.

MADELEINE 12/14 - should hotspot be kept in here? there is a comment saying “remove duplicative hotspot”, but then the next code chunk seems to need it

STEPHANIE 12/15 - that’s a good point, Madeleine; we’ll have to remove it for lm and rf but need to keep it in for decision tree

# Prep the dataset for modeling
dc_train <- dc_train |>
  st_drop_geometry() |>
  select(
    # remove identifying vars
    -block_group,
    -intptlat,
    -intptlon,
    -gis_id,
    -globalid, gid,
    -tract,
    -ward,
    -block_group_temp,
    # remove raw total pop var (custom pop density and other pop proportion
    # vars remain)
    -total_pop,
    # remove raw race vars (custom proportion vars remain)
    -race_white,
    -race_black,
    -race_ai_an,
    -race_asian,
    -race_nh_pi,
    -race_oth_mult,
    -race_eth_hisp,
    # remove raw ed vars (custom proportion vars remain)
    -ed_doct,
    -ed_prof_deg,
    -ed_master,
    -ed_bach,
    -ed_assoc,
    -ed_reg_hsd,
    -ed_ged,
    -ed_comb_hsd,
    -ed_comb_adv_deg,
    # remove duplicative housing data (custom proportion vars and pct var
    # remain)
    -total_hous_units,
    -occ_units,
    -vac_units,
    # remove duplicative land data (custom proportion / "pct" vars remain)
    -ends_with("_ac"),
    -total_sq_mi,
    -water_sq_mi,
    -wtr_pct,
    -aland,
    -awater,
    -utcac0620,
    -utcac1120,
    -utcac1520,
    -utcpct0620,
    -utcpct1120,
    -utc_pct_06,
    -utc_pct_11,
    -utc_pct_15,
    # remove pop_density_water_sq_mi due to INF value for most records 
    -pop_density_water_sq_mi) 

# V-fold cross validation with 10 folds
dc_folds <- vfold_cv(data = dc_train, v = 10)

Decision Tree

Modeling

dt_recipe <- recipe(formula = hotspot ~ ., data = dc_train) |>
  # remove uhi var to avoid duplication of data from hotspot
  step_rm(uhi) |>
  themis::step_downsample(hotspot) |> 
  step_nzv(all_numeric_predictors())

dt_model <- decision_tree() |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

dt_workflow <- workflow() |>
  add_recipe(dt_recipe) |>
  add_model(dt_model)
  
dt_fitting <- dt_workflow |>
  fit(data = dc_train)

rpart.plot::rpart.plot(x = dt_fitting$fit$fit$fit,
                       roundint = FALSE)

STEPHANIE 12/15 - Retaining comments for write-up in case we want to utilize them:

Note on results: I think we have an interesting model. The first decision point is whether the percentage of the block group that’s covered by tree canopy; if it’s greater or equal to 23% then it’s not a hotspot. If it is less than 23%, then you check how suitable the block group is for tree planting “based on a weighted formula that includes all planting prioritization categories.” If the score is less than 7, then it’s a not hotspot. If it’s greater than or equal to seven, you check the percent of total area that’s covered by all combined impervious surfaces. If the percentage is less than 66%, then it’s not a hotspot. If it’s greater than or equal to 66%, then the final check is whether the number of housing units per person is less than 0.74. If it’s greater than or equal to 0.75, then it’s not a hotspot. If it is less than 0.74 then it is a hotspot.

Note on methods: STEPHANIE 12/13 - I’ve decided to keep in downsampling because I think it’s useful. For reference, down-sampling “randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model).”

Visualizing Important Variables

STEPHANIE 12/13 - Using dc_full_data as the dataset since it’s possible we may accidentally do data leakage when mapping dc_train because we’ll notice which block groups were excluded.

canopy_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = utc_pct)) +
    scale_fill_gradient(
    low = "#fcbc32",
    high = "#4e974b"
    ) +
  theme_void() +
  labs(
    title = "Percentage of tree canopy",
    subtitle = "by Block Group",
    fill = "Percent Area",
    caption = "Open Data DC (2020)"
  )

planting_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = overall)) +
    scale_fill_gradient(
    low = "#e2e2e2",
    high = "#16501a"
    ) +
  theme_void() +
  labs(
    title = "Tree planting suitability score",
    subtitle = "by Block Group",
    fill = "Score",
    caption = "Open Data DC (2020)"
  )

impervious_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = to_ia_pct)) +
    scale_fill_gradient(
    low = "#ffffff",
    high = "#28292b"
    ) +
  theme_void() +
  labs(
    title = "Percentage covered by impervious surfaces",
    subtitle = "by Block Group",
    fill = "Percent Area",
    caption = "Open Data DC (2020)"
  )

housing_map <- dc_full_data |> ggplot() +
  geom_sf(aes(fill = hous_units_per_person)) +
    scale_fill_gradient(
    low = "#d86400",
    high = "#0085d4"
    ) +
  theme_void() +
  labs(
    title = "Housing units per person",
    subtitle = "by Block Group",
    fill = "# Housing Units",
    caption = "ACS 2016-2020"
  )

canopy_map + planting_map + impervious_map + housing_map

Evaluating the Model

STEPHANIE 12/15 - Adding this evaluation in since we need to assess the decision tree model on its own (can’t really compare it to rf and lm since they have different outcome variables).

Creating predictions

# Ensuring the testing dataset has the same variables as the training dataset
dc_test <- dc_test |>
  st_drop_geometry() |>
  select(
    # remove identifying vars
    -block_group,
    -intptlat,
    -intptlon,
    -gis_id,
    -globalid, gid,
    -tract,
    -ward,
    -block_group_temp,
    # remove raw total pop var (custom pop density and other pop proportion
    # vars remain)
    -total_pop,
    # remove raw race vars (custom proportion vars remain)
    -race_white,
    -race_black,
    -race_ai_an,
    -race_asian,
    -race_nh_pi,
    -race_oth_mult,
    -race_eth_hisp,
    # remove raw ed vars (custom proportion vars remain)
    -ed_doct,
    -ed_prof_deg,
    -ed_master,
    -ed_bach,
    -ed_assoc,
    -ed_reg_hsd,
    -ed_ged,
    -ed_comb_hsd,
    -ed_comb_adv_deg,
    # remove duplicative housing data (custom proportion vars and pct var
    # remain)
    -total_hous_units,
    -occ_units,
    -vac_units,
    # remove duplicative land data (custom proportion / "pct" vars remain)
    -ends_with("_ac"),
    -total_sq_mi,
    -water_sq_mi,
    -wtr_pct,
    -aland,
    -awater,
    -utcac0620,
    -utcac1120,
    -utcac1520,
    -utcpct0620,
    -utcpct1120,
    -utc_pct_06,
    -utc_pct_11,
    -utc_pct_15,
    # remove pop_density_water_sq_mi due to INF value for most records 
    -pop_density_water_sq_mi) 

# Creating predictions dataset
dt_predictions <- bind_cols(
  dc_test,
  predict(object = dt_fitting, new_data = dc_test),
  predict(object = dt_fitting, new_data = dc_test, type = "prob")
) 

# Editing column names so it's readable
names(dt_predictions) <- names(dt_predictions) |>
  str_replace(" ", "_")

# Printing a few predictions
dt_predictions |>
  select(hotspot, .pred_class, .pred_hotspot, .pred_not_hotspot) |>
  print(n = 20)
# A tibble: 116 × 4
   hotspot     .pred_class .pred_hotspot .pred_not_hotspot
   <chr>       <fct>               <dbl>             <dbl>
 1 not hotspot not hotspot         0.116             0.884
 2 not hotspot not hotspot         0.116             0.884
 3 not hotspot not hotspot         0.116             0.884
 4 not hotspot not hotspot         0.116             0.884
 5 not hotspot not hotspot         0.375             0.625
 6 not hotspot not hotspot         0.116             0.884
 7 not hotspot not hotspot         0.375             0.625
 8 not hotspot not hotspot         0.116             0.884
 9 not hotspot not hotspot         0.116             0.884
10 not hotspot not hotspot         0.116             0.884
11 not hotspot not hotspot         0.116             0.884
12 not hotspot not hotspot         0.116             0.884
13 not hotspot not hotspot         0.116             0.884
14 not hotspot not hotspot         0.263             0.737
15 not hotspot not hotspot         0.116             0.884
16 not hotspot not hotspot         0.116             0.884
17 not hotspot not hotspot         0.116             0.884
18 not hotspot not hotspot         0.116             0.884
19 not hotspot not hotspot         0.116             0.884
20 not hotspot not hotspot         0.116             0.884
# ℹ 96 more rows

Creating confusion matrix

# Transforming the hotspot variable to become a factor variable
dt_predictions$hotspot <- as.factor(dt_predictions$hotspot)

# Creating confusion matrix
conf_mat(data = dt_predictions,
         truth = hotspot,
         estimate = .pred_class)
             Truth
Prediction    hotspot not hotspot
  hotspot          16          13
  not hotspot       1          86

Calculating accuracy, precision, and recall/sensitivity:

accuracy(data = dt_predictions,
          truth = hotspot,
          estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.879
precision(data = dt_predictions,
          truth = hotspot,
          estimate = .pred_class)
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.552
recall(data = dt_predictions,
     truth = hotspot,
     estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.941

Linear Regression

JAMIE 12/12 - Added recipe for lm and rf and added lm model and fit below. Having trouble with this major warning: A | warning: prediction from rank-deficient fit; consider predict(., rankdeficient=“NA”) Also, I experimented here with removing all the columns that seemed duplicative. I did not add this to the dt recipe for now. Let me know what you all think of removing these.

STEPHANIE 12/13 - The main issue are the columns that have NA values. We need to decide if we should remove those or if it’s possible for us to do data imputation aka decide what those values should be. I removed the identifying factor variables becuase I had to remove them earlier from dc_train.

# Create recipe for use in linear regression and random forest models
# Create the recipe
dc_recipe <- recipe(formula = uhi ~ ., data = dc_train) |>
  # remove duplicative hotspot 
  step_rm(hotspot) |>
  # Removing predictors that have near zero variability
  step_nzv(all_predictors()) |>
  # Removing predictors that are highly correlated with others
  step_corr(all_predictors())

Modelling

# Linear regression model
dc_lm_model <- 
  linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

# Linear regression workflow
dc_lm_workflow <- 
  workflow() |>
  add_recipe(recipe = dc_recipe) |>
  add_model(spec = dc_lm_model)
  
unloadNamespace("Metrics")

dc_lm_resamples <- dc_lm_workflow |>
  fit_resamples(resamples = dc_folds)

STEPHANIE 12/15 - I took out fitting the model because we only really have to fit the best model between linear regression and random forests, since the purpose of fitting the model is to then create predictions from it, and we’d only want to create predictions based on the best model.

Random Forests

Modelling

Hyperparameter Tuning

dc_rf_model <- 
  rand_forest(
  trees = 200,
  mtry = tune(),
  min_n = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(
    engine = "ranger", 
    importance = "impurity",
    num.threads = 4
  )

# Create a parameter grid
dc_rf_grid <- grid_regular(mtry(range = c(1, 15)),
                           min_n(range = c(1, 15)),
                           levels = 5)

# Random forest workflow
dc_rf_workflow <- 
  workflow() |>
  add_recipe(dc_recipe) |>
  add_model(dc_rf_model)

# Creating a grid to show results of hyper parameter tuning
dc_rf_grid <- grid_regular(
  mtry(range = c(1, 15)),
  min_n(range = c(1, 15)),
  levels = 5
)

# Tuning the grid
rf_resamples <- tune_grid(
  dc_rf_workflow,
  resamples = dc_folds,
  grid = dc_rf_grid
)

# Showing the best hyperparameters
show_best(rf_resamples)
# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    15     8 rmse    standard   0.890    10  0.0338 Preprocessor1_Model15
2    15     1 rmse    standard   0.891    10  0.0317 Preprocessor1_Model05
3     8     4 rmse    standard   0.893    10  0.0324 Preprocessor1_Model08
4    11     1 rmse    standard   0.893    10  0.0329 Preprocessor1_Model04
5     8     1 rmse    standard   0.893    10  0.0315 Preprocessor1_Model03

STEPHANIE 12/15 - I took out the kable() argument from the line of code when we make dc_rf_grid becuase we don’t really need to see the grid at that point; what matters is the last line of code when you show which hyperparameters are the best ones to use. It looks like mtry = 15 and min_n = 4 are the best options.

Implementing best hyperparameters based on tuning

# Filling in hyperparameters based on tuning
dc_rf_model <- rand_forest(
  trees = 200,
  mtry = 15,
  min_n = 4
) |>
  set_mode(mode = "regression") |>
  set_engine(
    engine = "ranger", 
    importance = "impurity",
    num.threads = 4
  )

# Random forest workflow
dc_rf_workflow <- 
  workflow() |>
  add_recipe(dc_recipe) |>
  add_model(dc_rf_model)

rf_resamples <- dc_rf_workflow |>
  fit_resamples(resamples = dc_folds)

Comparing Models

Plot the RMSEs across each fit:

JAMIE 12/12 - Below is what I used in the assignment 07 hw to pull the metrics in the end, so pasted here and updated a bit. It renders!

STEPHANIE 12/15 - I edited the code to be a bit more simple (also based on what I submitted for assignment 07)

# Linear Regression: Calculating and plotting the RMSE for each resample
collect_metrics(
  dc_lm_resamples, 
  summarize = FALSE) |>
  filter(.metric == "rmse") |>
  print() |>
  ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
  geom_point() +
  geom_line()
# A tibble: 10 × 5
   id     .metric .estimator .estimate .config             
   <chr>  <chr>   <chr>          <dbl> <chr>               
 1 Fold01 rmse    standard    0.000411 Preprocessor1_Model1
 2 Fold02 rmse    standard    0.000590 Preprocessor1_Model1
 3 Fold03 rmse    standard    0.216    Preprocessor1_Model1
 4 Fold04 rmse    standard    0.000650 Preprocessor1_Model1
 5 Fold05 rmse    standard    0.000578 Preprocessor1_Model1
 6 Fold06 rmse    standard    0.000634 Preprocessor1_Model1
 7 Fold07 rmse    standard    0.000628 Preprocessor1_Model1
 8 Fold08 rmse    standard    0.174    Preprocessor1_Model1
 9 Fold09 rmse    standard    0.000629 Preprocessor1_Model1
10 Fold10 rmse    standard    0.000708 Preprocessor1_Model1

# Random Forests: Calculating and plotting the RMSE for each resample
rf_resamples |>
  collect_metrics(summarize = FALSE) |>
  filter(.metric == "rmse") |>
  print() |>
  ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
  geom_point() +
  geom_line()
# A tibble: 10 × 5
   id     .metric .estimator .estimate .config             
   <chr>  <chr>   <chr>          <dbl> <chr>               
 1 Fold01 rmse    standard       0.786 Preprocessor1_Model1
 2 Fold02 rmse    standard       0.942 Preprocessor1_Model1
 3 Fold03 rmse    standard       0.795 Preprocessor1_Model1
 4 Fold04 rmse    standard       0.884 Preprocessor1_Model1
 5 Fold05 rmse    standard       0.811 Preprocessor1_Model1
 6 Fold06 rmse    standard       0.755 Preprocessor1_Model1
 7 Fold07 rmse    standard       0.999 Preprocessor1_Model1
 8 Fold08 rmse    standard       0.855 Preprocessor1_Model1
 9 Fold09 rmse    standard       1.01  Preprocessor1_Model1
10 Fold10 rmse    standard       1.00  Preprocessor1_Model1

# Comparing average rmse
bind_rows(
  "Linear regression" = collect_metrics(dc_lm_resamples) |>
    filter(.metric == "rmse"), 
  'Random forest' = collect_metrics(rf_resamples) |>
    filter(.metric == "rmse"),
  .id = "model"
)
# A tibble: 2 × 7
  model             .metric .estimator   mean     n std_err .config             
  <chr>             <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 Linear regression rmse    standard   0.0394    10  0.0261 Preprocessor1_Model1
2 Random forest     rmse    standard   0.885     10  0.0314 Preprocessor1_Model1

STEPHANIE 12/15 - Based on this, it seems that linear regression is the better model over random forests by a fairly large margin, which I find interesting since I think usually random forest models out perform other models like lm and regression trees?

Estimate the out-of-sample error rate

JAMIE 12/13 - Added metrics work below STEPHANIE 12/15 - Sorry for my ignorance, but I’m not entirely sure what we’re doing here - is it finding another metric to compare models? Based on the RMSE values above, it seems we should go with the regression model and then create predictions with that and then visualize the best predictors. MADELEINE 12/15 - also, this chunk does not run because dc_lm_fit is not defined. Setting it to not evaluate for now

library(Metrics)

# save best lm fit
best_lm_fit <- fit_best(dc_lm_fit, verbose = TRUE)

# make predictions using best fit
test_predictions <-
  dc_test |>
  select(uhi) |>
  rename(y = uhi) |>
  mutate(y = as.numeric(y)) |>
  mutate(y_hat = as.numeric(as_vector(predict(best_lm_fit, dc_test))))

# collect rf metrics
kable(rmse(actual = test_predictions$y, predicted = test_predictions$y_hat))

Implement the Final Model

STEPHANIE 12/13 - Setting up the final code block for when we decide which model we should make predictions from. This probably won’t render because the random forest model isn’t working at the time I’m writing this (7pm) MADELEINE 12/15 - not sure if more needs to be done here, but when I just replace the argument of extract_workflow with dc_lm_model, I get two errors. 1) select(-hotspot): hotspot does not exist. 2) no applicable method for ‘extract_workflow’ applied to an object of class “c(‘linear_reg’, ‘model_spec’)” Setting to not evaluate for now

# Removing hotspot variable to be consistent with the recipe 
dc_test <- dc_test |>
  select(-hotspot)

pred_test <- bind_cols(
  dc_test,
  predict(object = extract_workflow(#replace with best fitting model), 
          new_data = dc_test)
  )
)

select(pred_test, uhi, starts_with(".pred")) |>
  print(n = 20)

MADELEINE 12/15: setting this to not evaluate for now since it’s blocking rendering ## Visualizing the most important predictors

#insert the best fitted model |>
  extract_fit_parsnip() |>
  vip(num_features = 20) %>%
  .$data |>
  mutate(
    Importance = Importance / max(Importance),
    Variable = fct_reorder(.x = Importance, .f = Variable)
  ) |>
  ggplot(aes(Importance, Variable)) +
  geom_col()

STEPHANIE 12/13 - Then with the code chunk below, we should make some maps with the top ~3 variables as determined by the code chunk above

Sources

https://www.climatecentral.org/climate-matters/urban-heat-islands-2024

https://opendata.dc.gov/datasets/DCGIS::urban-tree-canopy-by-census-block-group-in-2020/about

Census ACS 5yr 2016 - 2020

https://www2.census.gov/programs-surveys/acs/tech_docs/accuracy/MultiyearACSAccuracyofData2020.pdf

http://bit.ly/3DnlfIh